Unicode character property

Unicode assigns character properties to each code point.[1] These properties can be used to handle "characters" (code points) in processes, like in line-breaking, script direction right-to-left or applying controls. Slightly inconsequently, some "character properties" are also defined for code points that have no character assigned, and code points that are labeled like "<not a character>".

Properties have levels of forcefulness: normative, informative, contributory, or provisional. For practical reasons, a character property can be assigned by specifying a continuous range of code points that have the same property.

Contents

Character property

Name

Unicode characters are assigned a unique Name (na).[1] The name, in English, is composed of A-Z capitals, 0-9 digits, - (hyphen-minus) and <space>. Some sequences are excluded: beginning space, hyphen; ending space, hyphen; repeated spaces, hyphens; space after hyphen are not allowed. The name is guaranteed to be unique within Unicode, and can be used to identify a code point and its character. Ideographic characters, of which there are ten of thousands, are named in the pattern "cjk unified ideograph-hhhh", like for U+4E00 cjk unified ideograph-4e00. Formatting characters are named too: U+00A0   no-break space.

Starting from Unicode version 2.0, the published name for a code point will never change. In the event of a misspelling in a publication, a correct name will later be assigned to the code point as an Character Name Alias. Within the whole range of names, an alias is unique too.

Apart from these normative names, informal names can be assigned. These are usually other commonly used names for a character, used for illustration, but these informal names are not guaranteed to be unique.

The next code points do not have a Name (na=""): Controls (General Category: Cc), Private use (Cp), Surrogate (Cs), Non-characters (Cn) and Reserved (Cn). They may be referenced, informally, by a generic or specific meta-name, called "Code Point Labels": <control>, <control-0088>, <reserved>, <noncharacter-hhhh>, <private-use-hhhh>, <surrogate>. Since these labels contain <>-brackets, they can never appear as a Name, which prevents mixing up.

Version 1.0 names

In version 2.0 of Unicode, many names were changed. From then on the rule "a name will never change" came into effect, including the strict (normative) use of alias names. Disused version 1.0-names were moved to the property Alias, to provide some backward compatibility.

General Category

Each code point is assigned a value for General Category. This is one of the character properties that are also defined for unassigned code points, and code points that are defined "not a character".

Notes

  1. ^ Unicode 6.0, Chapter 4, table 4-9
  2. ^ a b Unicode 6.0, Chapter 2, table 2-3: Types of code points
  3. ^ a b Stability policy: Property Value Stability and table. Stability policy: Some gc groups will never change. gc=Nd corresponds with Numeric Type=De (decimal).
  4. ^ a b c d e Unicode 6.0, Chapter 4, table 4-12 Name=""; a Code Point Label may be used to identify a nameless code point. E.g. <control-hhhh>, <control-0088>. The Name remains blank, which can prevent inadvertently replacing, in documentation, a Control Name with a true Control code. Unicode also uses <not a character> for <noncharacter>.

Whitespace

Whitespace or Whitespace character is a commonly used concept for a typographic effect. Basically it covers invisible characters that have a spacing effect in rendered text. It includes spaces, tabs, and new line formatting controls. In Unicode, such a character has the property set "WSpace=yes". In version 6.0, there are 26 whitespace characters.

Whitespace[a] (Unicode character property WSpace=Y)
Code point Name Script General category Remark
&000009U+0009 Common Other, control HT, Horizontal Tab
&000010U+000A Common Other, control LF, Line feed
&000011U+000B Common Other, control VT, Vertical Tab
&000012U+000C Common Other, control FF, Form feed
&000013U+000D Common Other, control CR, Carriage return
&000032U+0020 space Common Separator, space
&000133U+0085 Common Other, control NEL, Next line
&000160U+00A0 no-break space Common Separator, space
&005760U+1680 ogham space mark Ogham Separator, space
&006158U+180E mongolian vowel separator Mongolian Separator, space
&008192U+2000 en quad Common Separator, space
&008193U+2001 em quad Common Separator, space
&008194U+2002 en space Common Separator, space
&008195U+2003 em space Common Separator, space
&008196U+2004 three-per-em space Common Separator, space
&008197U+2005 four-per-em space Common Separator, space
&008198U+2006 six-per-em space Common Separator, space
&008199U+2007 figure space Common Separator, space
&008200U+2008 punctuation space Common Separator, space
&008201U+2009 thin space Common Separator, space
&008202U+200A hair space Common Separator, space
&008232U+2028 line separator Common Separator, line
&008233U+2029 paragraph separator Common Separator, paragraph
&008239U+202F narrow no-break space Common Separator, space
&008287U+205F medium mathematical space Common Separator, space
&012288U+3000 ideographic space Common Separator, space
a. ^ Unicode 6.0, Chapter 4.6

Other important general characteristics

(dash, ideographic, alphabetic, noncharacter, deprecated, and so on)

Display-related properties

Shaping, mirroring, width, and so on.

Bidirectional writing

One of Unicode's major features is support of bi-directional (Bidi) text display R-to-L and L-to-R. The Unicode Bidirectional Algorithm UAX9[6] describes the process of presenting text with altering script directions. For example, it enables a Hebrew quote in an English text. To facilitate this feature, Unicode has defined seven special Bidi formatting control characters (LRM, LRE, LRO, RLM, RLE, RLO, PDF). These characters can enforce a direction, and by definition only affect this bi-directional writing.

Each code point has a property called Bidirectional Character Type, formally Bidi_Class. It defines their behaviour in a bidirectional text as interpreted by the algorithm. There are 19 possible types.

In normal situations, the algorithm can determine the direction of a text by this character property. To control more complex Bidi situations, e.g. when an English text has a Hebrew quote, extra options are added to Unicode. Seven characters have the property Bidi_Control=Yes: LRM, RLM, LRE, RLE, PDF, LRO, RLO as named in the table. These are invisible formatting control characters, only used by the algorithm and with no effect outside of bidirectional formatting.[6] Despite the name, they are formatting characters, not control characters, and have General category "Other, format (Cf)" in the Unicode definition.

Basically, the algorithm determines a sequence of characters with the same strong direction type (R-to-L or L-to-R), taking in account an overruling by the special Bidi-controls. Number strings (Weak types) are assigned a direction according to their strong environment, as are Neutral characters. Finally, the characters are displayed per string's direction.

Two other character properties are relevant to the bidirectional text: Bidi_Mirrored=Yes indicates that the glyph should be mirrored when written R-to-L. The property Bidi_Mirroring_Glyph=U+hhhh can then point to the mirrored character. For example, brackets "()" are mirrored this way. Shaping cursive scripts such as Arabic, and mirroring glyphs that have a direction, is not part of the algorithm.

Casing

The Case value is Normative in Unicode. It pertains to those scripts with uppercase (aka capital, majuscule) and the lowercase (aka small, minuscule) letter. Case-difference occurs in the scripts Latin, Greek, Coptic, Cyrillic, Glagolitic, Armenian, Deseret, and archaic Georgian.

(upper, lower, title, folding—both simple and full)

Numeric values and types

Characters are classified with a Numeric type.[1] Numeric are all characters such as fractions, subscripts, superscripts, Roman numerals, currency numerators, encircled numbers, and script-specific digits. All these have a numeric value that can be decimal, including zero and negatives, but also a vulgar fraction. If there is not such a value, as with most of the scripts, the numeric type is "None".

The numeric characters are separated in three groups: Decimal (De), Decimal ideographic (Di) and Numeric (Nu, i.e. all other). "Decimal" means the character is a straight decimal digit. Here are excluded fractions, encircled numbers, superscripts etc., which end up with the type "Numeric". The intended effect is that an even more simple parser can use these decimal numeric values, without being distracted by say a numeric superscript or a fraction. Some 41 CJK Ideographs that represent a number, including those used for accounting, are typed "Decimal, ideographic".

On the other hand, characters that could have a numeric value as a second meaning are still marked Numeric type "None", and have no numeric value (""). E.g. Latin letters can be used in paragraph numbering like (II.A.1.b), but the letters "I", "A" and "b" are not numeric (type "None") and have no numeric value.

Numeric Type[a] (Unicode character property)
Value Numeric type Is numeric Remarks
None Not numeric No No numeric value
De Decimal digit Yes Straight digit (decimal-radix). Corresponds both ways with General Category=Nd[b]
Di Decimal ideograph Yes CJK ideograph number
Nu Numeric Yes All other: superscript, fraction, encirceled
a. ^ Unicode 6.0, Chapter 4.6
b. ^ Property Value Stability, in Stability policy.

Block

A block is a named, continuous range of code points. It is identified by its first and last code point. It may contain code points that are reserved, not-assigned etc. Each character that is assigned, has a single "block name" value from the currently 209 names. Unassigned code points outside of an existing block, have the default value "No_block".

Unicode blocks and contained scripts

Notes

  1. ^ a b c Unicode Blocks data file. As of Unicode version 6.0
  2. ^ a b c UAX 24: Unicode Script Property (4alpha code)
  3. ^ a b c UAX 24: Script data file
  4. ^ a b c Including unassigned code points: non-character, reserved
  5. ^ a b c The script has one or multiple characters in the block, as defined by the Script Property. This is independent of the block name
  6. ^ a b c "Common" (Zyyy) and "Inherited" (Qaai or Zinh) refer to Scripts in ISO 15924

Script

Each assigned character can have a single value for its "Script" property, signifing to which script it belongs.[13] The value is a four-letter code in the range Aaaa-Zzzz, as available in ISO 15924, which is mapped to a writing system. Apart from when describing the background and usage of a script, Unicode does not use a connection between a script and languages that use that script. So "Hebrew" refers to the Hebrew script, not to the Hebrew language.

The special code Zyyy for "Common" allows a single value for a character that is used in multiple scripts. The code Zinh "Inherited script", used for combining characters and certain other special-purpose code points, indicates that a character "inherits" its script identity from the character with which it is combined. (Unicode formerly used the private code Qaai for this purpose.) The code Zzzz "Unknown" is used for all characters that do not belong to a script (i.e. the default value), such as symbols and formatting characters. Overall, characters of a single script can be scattered over multiple blocks, like Latin characters. And the other way around too: multiple scripts can be present is a single block, even when the block name suggests different: e.g. block Letterlike Symbols contains characters from the Latin, Greek and Common scripts.

When the Script is "" (blank), according to Unicode the character does not belong to a script. This pertains to symbols, because the existing ISO script codes "Zmth" (Mathematical notation) and "Zsym" (Symbol) are not used in Unicode. The "Script" property is also blank for code points that are not a typographic character like controls, substitutes, and private use code points.

If there is a specific script alias name in ISO 15924, is used in the character name: U+0041 A latin capital letter a, and U+05D0 א hebrew letter alef.

Normalization properties

(decompositions, decomposition type, canonical combining class, composition exclusions, and so On)

Age

"Age" is the version of the Standard in which the code point was first designated. The version number is shortened to the numbering major.minor, although there more detailed version numbers are used: versions 4.0.0 and 4.0.1 both are named 4.0 as Age. Given the releases, Age can be from the range: 1.0, 1.1, 2.0, 2.1, 3.0, 3.1, 3.2, 4.0, 4.1, 5.0, 5.1, 5.2 and 6.0.[14][15]

Boundaries

(grapheme cluster, word, line, and sentence)

References

  1. ^ a b c Unicode 6.0 chapter 4
  2. ^ a b UAX 9, Standard Annex "Unicode Bidirectional Algorithm"
  3. ^ [1] Unicode Standard Annex #24: Unicode Script Property
  4. ^ Pre version 4
  5. ^ Versions 4.0 and later